This notebook is a prototype of the full pipeline used in this project. It will provide an example of the entire project that is currently spread over multiple notebooks, covering every step. Including:
There are three files that are generated at this stage. All consist of feature vectors corresponding to protein pairs in the following three sets:
This pipeline will take advantage of a prepared list of interactions to build the network rather than trying to classify all possible combinations of the pulldown proteins.
To run the assembler a data source table must first be generated. This table describes the data sources to be used to build feature vectors. In this case, we will be using simple features with high coverage to avoid dealing with the missing data problem. This task is also performed in the notebook on training and test set generation.
This table must be placed at the top directory of the feature data:
In [9]:
cd ../..
In [5]:
import csv
In [4]:
!git annex unlock datasource.proto.tab
In [5]:
f = open("datasource.proto.tab","w")
c = csv.writer(f,delimiter="\t")
# the HIPPIE feature
c.writerow(["HIPPIE/hippie_current.txt","HIPPIE/feature.HIPPIE.db","protindexes=(1,3);valindexes=(4);zeromissing=1"])
# Gene Ontology features
c.writerow(["Gene_Ontology","Gene_Ontology","generator=geneontology/testgen.pickle"])
# iRefIndex features
c.writerow(["iRefIndex","iRefIndex","generator=iRefIndex/human.iRefIndex.Entrez.1ofk.pickle"])
# STRING features
c.writerow(["STRING","STRING","generator=string/human.Entrez.string.pickle"])
# ENTS summary feature
c.writerow(['ENTS_summary','ENTS_summary','generator=ents/human.Entrez.ENTS.summary.pickle'])
f.close()
In [4]:
import sys
In [7]:
sys.path.append("opencast-bio/")
In [8]:
import ocbio.extract
In [13]:
reload(ocbio.extract)
Out[13]:
In [10]:
!git annex unlock HIPPIE/feature.HIPPIE.db
In [14]:
assembler = ocbio.extract.FeatureVectorAssembler("datasource.proto.tab", verbose=True)
At this point the method assembler.regenerate
could be used to make sure the database is up to date, but since there is only one and it is known to be up to date there is no point.
Next we use the assembler.assemble
method to create the featurem vectors, supplying the method with a list of Entrez Gene ID pairs to match to these vectors:
In [20]:
assembler.assemble("DIP/human/training.nolabel.positive.Entrez.txt",
"features/training.nolabel.proto.positive.Entrez.vectors.txt", verbose=True)
In [15]:
assembler.assemble("DIP/human/training.nolabel.negative.Entrez.txt",
"features/training.nolabel.proto.negative.Entrez.vectors.txt", verbose=True)
In [16]:
assembler.assemble("forGAVIN/mergecode/edgelist_new.txt",
"features/edgelist.proto.vectors.txt", verbose=True)
Now that the feature vectors have been written we will use the training vectors to train a classifier which we can apply to the vectors written to edgelist.proto.vectors.txt
.
We will be using an ensemble classifier to slightly improve performance over a simple classifier such as logistic regression.
The chosen classifier, as stated in the proposal, is a random forest classifier as it has proven to have superior performance in Qi's 2006 paper.
The task of protein interaction prediction as approached here is a supervised binary classification problem. What do those words actually mean?
A random forest classifier is a combination of large number of generated decision trees. Decision trees are intuitive to understand; the name practically describes what they are. At each node in the tree one of the features is tested and a decision is made on what feature to check next or which output to provide:
Random forests simply generate a large number of these trees with varying structures and then combine the results.
The training data is a sparse dataset in that it contains a very large number of zeros.
This can be efficiently represented using a Scipy sparse matrix.
The csv can be loaded using the csv
module of Python:
In [17]:
import csv
import scipy.sparse
At the moment we have to read off the rows and columns to initialise this matrix with from above.
In [80]:
#initialising matrices
posrows,poscolumns = 4857,111
negrows,negcolumns = 3061800,111
posvectors = scipy.sparse.lil_matrix((posrows,poscolumns))
negvectors = scipy.sparse.lil_matrix((negrows,negcolumns))
In [81]:
#loading positive vectors
f = open("features/training.nolabel.proto.positive.Entrez.vectors.txt")
c = csv.reader(f, delimiter="\t")
lcount = 0
for line in c:
line = array(line)
posvectors[lcount,:] = line.astype(np.float)
lcount += 1
f.close()
In [82]:
#loading negative vectors
f = open("features/training.nolabel.proto.negative.Entrez.vectors.txt")
c = csv.reader(f, delimiter="\t")
lcount = 0
for line in c:
line = array(line)
negvectors[lcount,:] = line.astype(np.float)
lcount += 1
f.close()
At this point there are two large matrices posvectors
and negvectors
.
As the names suggest they are the positive and negative feature vectors.
Scikit-learn requires binary labels for these vectors.
To do this we create a vector of labels the same length as these two matrices combined.
These are then shuffled to ensure that the training a test sets both contain a random mixture of positive and negative examples.
This task is also shown in the notebook on training the classifier.
When training a classifier there is usually a distinction between the training and test sets. The difference between them is that the training set is actually used to fit the model. To avoid overfitting we test this fitted model on a randomly selected training set as testing the model on the training set will typically produce unrealistic results. This will be demonstrated below.
In [85]:
# making the label vector
ylabels = array([1.0]*posvectors.shape[0] + [0.0]*negvectors.shape[0])
In [86]:
# shuffling is done using scikit-learn's shuffle
import sklearn.utils
In [88]:
# concatenate the data together and shuffle
X,y = sklearn.utils.shuffle(scipy.sparse.vstack([posvectors,negvectors]),ylabels)
In [90]:
#find quarter mark
quarter = int(X.shape[0])/4
#split into training and test
Xtrain, Xtest = X[:3*quarter], X[3*quarter:]
ytrain, ytest = y[:3*quarter], y[3*quarter:]
The bulk of this work is done inside Scikit-learn's random forest classifier.
All classifiers in Scikit-learn have a similar interface.
First, the classifier is instantiated with the relevant options.
In this case n_estimators
, which is the number of trees to use in the forest.
Second, the fit
method is used to fit the model.
The training set is used here as the X
and y
arguments.
In [91]:
import sklearn.ensemble
In [93]:
randomforest = sklearn.ensemble.RandomForestClassifier(n_estimators=10)
In [108]:
randomforest.fit(Xtrain.todense(),ytrain)
Out[108]:
After we have fit the model, typically we want to know whether our model fits the data well or not. In the task of classification we would like to get an idea of the error rate and accuracy we can expect when we apply this model to our problem. A typical tool to quickly describe the sensitivity and specificity of a binary classifier is a ROC curve.
ROC stands for Receiver Operating Characteristic. It is a plot of the true positive rate against the false positive rate for a classifier as the classification threshold is varied. The wikipedia page describes it in more detail if required.
We can build ROC curves running the classifier on both the test data and training data to illustrate the overfitting problem mentioned above. First, running the classifier on the training data should give us inflated results:
In [96]:
import sklearn.metrics
In [120]:
def plotroc(ytest,estimates):
fpr, tpr, thresholds = sklearn.metrics.roc_curve(ytest,estimates[:,1])
clf()
plot(fpr, tpr)
plot([0, 1], [0, 1], 'k--')
xlim([0.0, 1.0])
ylim([0.0, 1.0])
xlabel('False Positive Rate')
ylabel('True Positive Rate')
title('Receiver operating characteristic')
show()
print "Area under curve (AUC): {0}".format(sklearn.metrics.auc(fpr,tpr))
return fpr,tpr
In [117]:
estimates = randomforest.predict_proba(Xtrain.todense())
In [123]:
trfpr,trtpr = plotroc(ytrain,estimates)
In [124]:
estimates = randomforest.predict_proba(Xtest.todense())
In [125]:
tefpr,tetpr = plotroc(ytest,estimates)
The difference on this case is actually very small, and this is likely a characteristic of the data used to train the model.
Cross validation involves taking multiple splits of the training and test data and evaluating the AUC on the test data of each. In the real pipeline this will be the preferred method of testing to avoid outlier test sets which predict unrealistic performance.
AUC values give a single value figure of merit of a ROC curve. The closer the value is to 1, the closer the classifier is to being a perfect predictor for the data.
The training labels in this task are very sparsely populated, with 600 zeros for every one. On this kind of data, and as mentioned by Qi et al, it is more common to use a reduced AUC value, which only considers the first portion of the ROC curve. This is because in the later sections of the curve the false positive rate of the predictor is rising and in a practical setting only low false positive rates are acceptable.
A value described in Qi et al (2006) was the R50 value. This is described as being the are under the curve up until there are 50 false positives detected. In their case this is up to a false positive rate of 0.167%. We have 766,665 test examples so our R50 value would be up to a rate of 0.000652%, which is clearly impractical.
The description in the paper of this is likely badly worded. It also claims that this value is in widespread use, but apart from in this paper it is very difficult to find any mention of it. In any case, we can find the area under this ROC curve up until reaching a false positive rate of 0.167%.
In [171]:
r50trfpr = trfpr[trfpr<0.00167]
r50crv = array([r50trfpr,trtpr[:r50trfpr.shape[0]]])
print "R50 value for Training data {0}".format(sum(r50crv[1,:])/len(r50crv[1,:]))
In [174]:
r50tefpr = tefpr[tefpr<0.00167]
r50crv = array([r50tefpr,tetpr[:r50tefpr.shape[0]]])
print "R50 value for Test data {0}".format(sum(r50crv[1,:])/len(r50crv[1,:]))
Or up until reaching a false positive rate of 0.000652%:
In [175]:
r50tefpr = tefpr[tefpr<0.00000652]
r50crv = array([r50tefpr,tetpr[:r50tefpr.shape[0]]])
print "R50 value for Test data {0}".format(sum(r50crv[1,:])/len(r50crv[1,:]))
Interestingly, the first value of 0.631 compares well with the performance of the classifiers in the paper by Qi et al (2006).
Here, we are using a network of pulldown proteins created through merging known interactions from three sources when queried for a list of proteins returned by a pulldown experiment. These three sources were:
This list of interactions was then hand-curated by removing some proteins with extremely high numbers of connections to aid the community detection algorithms. The resulting list of interactions defines the protein-protein interaction network of interest. Adding weights to this list of edges is a key part of this project. This will be done using the classifier trained above and the feature vectors created for this list of interactions.
In [177]:
# using loadtxt now as it's small
pulldownvectors = loadtxt("features/edgelist.proto.vectors.txt")
In [227]:
cd ..
In [228]:
pulldownestimates = randomforest.predict_proba(pulldownvectors)
In [229]:
# write the results to a file
# with the associated protein pairs
f,fr = open("features/edgelist.proto.predictions.txt", "w"), open("forGAVIN/mergecode/edgelist_new.txt")
c,cr = csv.writer(f,delimiter="\t"),csv.reader(fr,delimiter="\t")
lcount = 0
for line in cr:
c.writerow(line+["%.5f"%pulldownestimates[lcount,1]])
lcount += 1
f.close()
Unfortunately, because there was a header in the edgelist file, the first "interaction" is between the string "columns" and the number 2. This must be removed:
In [230]:
!head -n 1 features/edgelist.proto.predictions.txt
In [231]:
!tail -n +2 features/edgelist.proto.predictions.txt > features/edgelist.proto.predictions.txt.tmp
In [232]:
!mv features/edgelist.proto.predictions.txt.tmp features/edgelist.proto.predictions.txt
The next step is to apply a community detection algorithm to both the weighted and unweighted files. This algorithm has already been written in C++, and all that has to be done is for it to be executed as in the notes on community detection.
First, it must be compiled:
In [10]:
cd HBP/
In [201]:
!make
In [206]:
!./run -file ../features/edgelist.proto.predictions.txt -weighted 0 -algorithm 2 -quite 1
The output is always written to the OUT
file, so moving it before running again.
In [207]:
!mv OUT protounweighted
In [13]:
!mkdir OUT
In [14]:
!./run -file ../features/edgelist.proto.predictions.txt -weighted 1 -algorithm 3 -quite 1
In [15]:
!mv OUT protoweighted
The above warning may be a problem: Warning: Too many iterations in tqli
.
Will have to ask about it.
The results using weighted an unweighted can be compared in terms of their similarity or in terms of their disease enrichment results. The similarity between the results can be summarised by Normalised Mutual Information, which returns a summary value between 0 and 1 which indicates how similar the communities are. Results of the disease enrichment test require interpretation by eye.
Unfortunately, having problems running the code below at the moment. Need to look into it.
When running the above community detection code the communities found are compared in terms of a disease enrichment test which indicates the likelihood a given community is involved in a given disease. This is also targeted by default on a specific set of markers for Schizophrenia. Looking at the differences between these two results will take place in a more detailed treatment of the community detection part of this project and will probably be found in the future in the community detection notebook.
In [1]:
from __future__ import division
import collections
import math
In [2]:
#--- helper function
def convertStr(s):
ret = int(s)
return ret
In [22]:
import pdb
In [23]:
def NMI(fname1,fname2):
rows, new, dic = [], [], []
consensus = fname1
community = fname2
#--- readin first community file
with open(consensus, 'r') as csvfile:
reader = csv.reader(csvfile, delimiter='=')
headerline = reader.next();
for row in reader:
new.append(row)
#--- readin second community file
with open(community, 'r') as csvfile:
reader = csv.reader(csvfile, delimiter='=')
headerline = reader.next();
for row in reader:
dic.append(row)
for i in dic:
for j in new:
if i[0] == j[0]:
i.append(j[1])
pdb.set_trace()
dic.sort(key=lambda x: int(x[2]))
#--- count of nodes in each community in reality and in algorithm
realc, algc = [], []
for i in dic:
realc.append(convertStr(i[1]))
algc.append(convertStr(i[2]))
comcount = dict([(i,realc.count(i)) for i in set(realc)])
pcount = dict([(i,algc.count(i)) for i in set(algc)])
#--- holds the algorithm community label with the corresponding community labels
# provided e.g., {1:{2:2,3:3}}
algcount, tempd = {}, {}
templist = []
comtrack = 0
for key, val in pcount.iteritems():
templist = realc[comtrack:comtrack+val]
comtrack += val
for i in set(templist):
tempd[i] = templist.count(i)
algcount[key] = tempd
tempd = {}
#--- finding the node with the maximum label in that community
# key = int, value = dictionary
label = {}
for key, value in algcount.iteritems():
label[key] = max(value, key = value.get)
for i in dic:
for j in label:
if (int)(i[2]) == j:
i.append(label[j])
#--- METRICS
#--- Calculate the NMI
nt, np, ntp = {}, {}, {}
NMI = 0
nodes = len(dic)
for i in range(1, max(comcount)+1):
for j in range(1, max(pcount)+1):
nt[i] = 0
np[j] = 0
ntp[(str(i)+' '+str(j))] = 0
nt = comcount
np = pcount
for i in dic:
ntp[(str(i[1]).strip()+' '+str(i[2]).strip())] += 1
num, den = 0, 0
for i, t in nt.iteritems():
for j, p in np.iteritems():
temp, temp2 = 0, 0
if (ntp[(str(i).strip()+' '+str(j).strip())] > 0) and (t > 0) and (p > 0):
temp = ((ntp[(str(i).strip()+' '+str(j).strip())]*nodes)/(t*p))
temp2 = ntp[(str(i).strip()+' '+str(j).strip())]* math.log(temp)
num += temp2
d1, d2 = 0, 0
for t in nt.values():
d1 += t * math.log(t/nodes)
for p in np.values():
d2 += p * math.log(p/nodes)
den = d1 + d2
NMI = -(2*num)/den
#--- print results to screen
print 'NMI = %s' %(NMI)
f = open ( 'OUT/community-metric_nmi.txt', 'w' )
f.writelines("node\treal\talg\tnewlabel\n")
f.writelines("%s\t%s\t%s\t%s\n" % (i[0],i[1],i[2],i[3]) for i in dic)
f.writelines('NMI = %s' %(NMI))
f.close()
return None